Investigation into the Relationship between GDP and Child Mortality Rates over Time

STAT 331 Final Project

Author

Alex Sullivan, Megan Kerner, Isabelle Phraner, Vi-Linh Vu

Project Proposal - Data

Reading in Data

Code
# | code-fold: true
library(tidyverse)
library(here)
library(broom)
library(purrr)
library(knitr)
library(gganimate)
child_mortality <- read_csv("child_mortality_0_5_year_olds_dying_per_1000_born.csv")
GDP_per_Cap <- read_csv("gdp_pcap.csv")

1 Data Description and Cleaning Process

Our analysis explores the relationship between child mortality and GDP per capita across various countries over time. We utilize two primary datasets:

  1. Child Mortality Dataset: Combines data from multiple sources including Gapminder (v7) for years 1800 to 1950, UNIGME (a collaboration among UNICEF, WHO, the UN Population Division, and the World Bank) for years 1950 to 2016, and UN POP (World Population Prospects 2019) for projections up to 2100.

  2. GDP per Capita Dataset: Aggregates GDP per capita data from the World Bank, the Maddison Project Database, and the Penn World Table, offering a comprehensive view of economic performance across nations over time.

To ensure the relevance and accuracy of our analysis, we confined our examination to the period between the 1950s and 2020. A few factors motivated this decision. First, due to timelines of independence and colonial rule, many of the countries in the dataset did not formally exist until the 1950s or 1960s, and were instead territories of world powers such as England. Second, by refining our time frame, we are able to analyze historical trends without projected data beyond 2020 skewing our results. Having more consistent data also allows us to minimize the impact of NA values.


Code
# | code-fold: true
# Refine child_mortality dataset
child_mortality <- child_mortality |>
  select(country, `1950`:`2020`)

# Refine GDP_per_Capita dataset
GDP_per_Cap <- GDP_per_Cap |>
  select(country, `1950`:`2020`)

child_mortality_long <- child_mortality |>
  pivot_longer(
    cols = !country,
    names_to = "Year",
    values_to = "Mortality Rate"
  )

GDP_per_Cap_long <- GDP_per_Cap |>
  pivot_longer(
    cols = !country,
    names_to = "Year",
    values_to = "GDP Per Capita"
  )

# function converts numbers in "XXk" format to numeric
# eg "89k" -> 89,000
convert_k_to_number <- function(x) {
  if (grepl("k", x, fixed = TRUE)) {
    numeric_value <- as.numeric(sub("k", "", x)) * 1000
  } else {
    numeric_value <- as.numeric(x)
  }
  return(numeric_value)
}

# converts mortality rate to numeric
child_mortality_long <- child_mortality_long |>
  filter(!is.na(child_mortality_long$`Mortality Rate`)) |>
  mutate(`Mortality Rate` = as.numeric(`Mortality Rate`))

# converts gdp to numeric
GDP_per_Cap_long <- GDP_per_Cap_long |>
  mutate(`GDP Per Capita` = sapply(`GDP Per Capita`, convert_k_to_number))

joined_df <- inner_join(child_mortality_long, GDP_per_Cap_long, join_by(country, Year)) |>
  mutate(Year = as.numeric(Year))

We pivoted the data to be longer so it can be easily analyzed (on mortality rate and GDP per capita for each year). Then, we converted the values to be numeric as they represent quantitative data. For the GDP, we had to convert some values that had a “k” in it to the actual multiple of 1000. The final dataset has 195 different countries, spanning across 71 years.

1.1 Variable Description

The first variable in our joined data set is “country”, which specifies the country that the following data describes. The next variable is “Year”, which specifies the year in which the following data was collected. In our data, we limited our “Year” variable from 1950-present because many countries did not exist and therefore had missing data prior to then. The “Mortality Rate” variable represents the number of deaths of children under 5 years old per 1000 live births. Finally, the “GDP Per Capita” variable represents the gross domestic product per person. The data for “GDP Per Capita” was also adjusted for differences in purchasing power.

According to the International Monetary Fund, GDP is a monetary measure of the market value of all the final goods and services produced in a specific time period by a country or countries. GDP is often used by the government of a single country to measure its economic health.

Our hypothesis predicts a negative correlation between GDP per capita and child mortality rates, based on the assumption that higher economic status, as indicated by GDP per capita, is associated with enhanced access to healthcare, improved living conditions, and higher investment in public health infrastructure. These conditions are expected to contribute to lower child mortality rates, reflecting better health outcomes in economically prosperous countries.

We also expect a significant decrease in child mortality rates as time progresses, echoing global health advancements and the intensification of international efforts to combat child mortality. This expectation is supported by historical data, including reports from the World Health Organization, which indicate a decline of over 59% in child mortality rates since 1990. Such trends underscore the impact of both economic development and targeted health interventions in improving child survival rates.

2 Linear Regression

In our study, we employ linear regression to examine the potential impact of GDP per capita on child mortality rates. Linear regression is an appropriate statistical method for this analysis as it allows us to model the relationship between a continuous response variable and one or more explanatory variables. We hypothesize that GDP per capita (explanatory variable, x) inversely affects the child mortality rate (response variable, y^​), with the expectation that higher GDP per capita is associated with lower child mortality rates. In order to establish whether linear regression was appropriate, we revisited our LINE assumptions:

We made two graphs to examine and visualize the relationship between GDP per Capita and Child Mortality Rate. We want to see how GDP per Capita (the explanatory variable) potentially affects the Child Mortality Rate (the response variable).

As a preliminary check, we created two scatterplots to visually assess the relationship between GDP per capita and the Child Mortality Rate over time. Our aim was to discern the form, direction, strength, and presence of any unusual observations in the relationship between these two variables.

Code
# | code-fold: true
year_avgs <- joined_df |>
  group_by(country) |>
  summarise(
    Avg_Mortality_Rate = mean(`Mortality Rate`, na.rm = TRUE),
    Avg_GDP_Per_Capita = mean(`GDP Per Capita`, na.rm = TRUE)
  )

2.1 Visualizations

Code
# | code-fold: true
year_avgs |>
  ggplot(aes(x = Avg_GDP_Per_Capita, y = Avg_Mortality_Rate)) +
  geom_point(alpha = 0.5, color = 'blue') +
  theme_minimal() +
  labs(x = "Average GDP Per Capita", y = "Average Mortality Rate",
       title = "Average Mortality Rate vs Average GDP Per Capita by Country")

The scatterplot of average GDP per capita against average mortality rate shows a general downward trend, suggesting a negative correlation where higher GDP per capita is associated with lower child mortality rates. The relationship does exhibit a non-linear form, with the strength of the relationship varying across the range of GDP values. Notably, some countries with similar GDP per capita exhibit a wide range of mortality rates, indicating other factors may influence mortality beyond GDP alone.

Code
# | code-fold: true
joined_df |>
  mutate(Decade = floor(Year / 10) * 10) |>
  ggplot(aes(x = `GDP Per Capita`, y = `Mortality Rate`)) +
  geom_point(alpha = 0.25, color = 'purple') +
  facet_wrap(~ Decade, scales = "free") +
  labs(title = "GDP Per Capita vs Child Mortality Rate Over Time",
       x = "GDP Per Capita",
       y = "Mortality Rate") +
  theme_minimal()

Code
joined_df |>
  mutate(Decade = floor(Year / 10) * 10) |>
  ggplot(aes(x = `GDP Per Capita`, y = `Mortality Rate`)) +
  geom_point(alpha = 0.25, color = 'purple') +
 labs(title = 'Year: {frame_time}', x = 'GDP Per Capita', y = 'Mortality Rate') +
  transition_time(Year) +
  ease_aes('linear')

The second graph plots the GDP per capita per against the child mortality rate and is faceted by decade in order to see how the relationship between the two variables change over time. At a glance, it does not appear that the relationship between the two variables change drastically over the decades as all of the grids have a similar concave shape. While the overall shape of the relationship between the variables is similar across decades, it’s evident that the actual numerical values do vary over time, as indicated by the differing scales on each graph.

In evaluating the appropriateness of the model, we revisit out LINE assumptions;
Linearity: The relationship between GDP per capita and child mortality rate appears linear, Independence: Observations are independent of each other, Normality: The residuals of the model are normally distributed, Equal Variance: The variance of the error terms is constant across values of GDP per capita.

Based on our scatter plots so far, the model does not appear to be very linear, and instead exhibits the shape of either a logarithmic function or of exponential decay. We can however confirm that the data are independent: it’s unlikely that American child mortality has an effect on Angolan or Brazilian child mortality. Based on trade relationships though, it is possible that American, British, and German GDPs may have relationships with each other.

2.2 Regression and Model Fit

Code
# | code-fold: true
mortalitygdp_lm <- lm(Avg_Mortality_Rate ~ Avg_GDP_Per_Capita,
  data = year_avgs)

mortality_regression <- tidy(mortalitygdp_lm)

kable(mortality_regression, caption = "GDP Per Capita Regression Model")
GDP Per Capita Regression Model
term estimate std.error statistic p.value
(Intercept) 128.4115700 5.3256071 24.112100 0
Avg_GDP_Per_Capita -0.0025385 0.0002628 -9.658573 0

Based on the results of our linear regression, we use the equation \[\hat{y} = 128 - 0.00254x\] to represent the relationship between GDP and child mortality. In this model, \(\hat{y}\) represents the predicted child mortality rate, \(x\) represents represents the GDP per capita, \(b0\) (intercept) indicates the child mortality rate when GDP per capita is zero, and \(b1\) (slope) represents the change in child mortality rate for a one-unit increase in GDP per capita.

We find that the average child mortality rate is 128.4 when GDP is zero. We also find that, holding other factors constant, a one-unit ($1000 USD) increase in GDP corresponds to the child mortality rate decreasing -0.0025 on average. It’s important to note that the interpretation of the slope coefficient involves units from both variables, confirming the validity of using different scales in our analysis.

Code
# | code-fold: true
aug <- broom::augment(mortalitygdp_lm)
variance_table <- data.frame(
  Variable = c("Response Values", "Fitted Values", "Residuals"),
  Variance = c(var(aug$Avg_Mortality_Rate), 
               var(aug$.fitted), 
               var(aug$.resid))
)
kable(variance_table, caption = "Variance Table") 
Variance Table
Variable Variance
Response Values 4715.485
Fitted Values 1536.559
Residuals 3178.926

The variance table derived from our model’s residuals allows us to calculate the R-squared using the formula \(R^2 = \frac{{\text{var}( \hat{y})}}{{\text{var}(y)}}\), with \(\hat{y}\) being the variance of the fitted values and \(y\) being the variance of the response values. Plugging these values into our formula, we get an \(R^2\) value of approximately 0.3258. This indicates that approximately 32.58% of the variance in child mortality rates is explained by GDP per capita. While this suggests a significant relationship, it also highlights the presence of other factors affecting child mortality that are not captured by GDP per capita alone. The measurement of child mortality rates was based on the probability of dying between birth and age 5 per 1,000 live births, providing a standardized metric for international health comparisons.

3 Simulation

3.1 Visualizing Simulations from the Model

Code
# 1. generate predictions using  predict() 
predictions <- predict(mortalitygdp_lm, newdata = year_avgs)

# 2. add random errors to the predictions w/ residual standard error (acquired with sigma())
std_error <- sigma(mortalitygdp_lm)
simulated_errors <- rnorm(n = length(predictions), mean = 0, sd = std_error)

# 3. plotting doom

simulated_year_avgs <- year_avgs|>
    mutate(Simulated_Mortality = predictions + simulated_errors)


# obs
observed_plot <- ggplot(year_avgs, aes(x = Avg_GDP_Per_Capita, y = Avg_Mortality_Rate)) +
  geom_point(aes(color = "Observed"), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Observed Average Mortality Rate vs. Average GDP Per Capita",
       color = "Data Type")


# sim

simulated_plot <- ggplot(simulated_year_avgs, aes(x = Avg_GDP_Per_Capita, y = Simulated_Mortality)) +
  geom_point(aes(color = "Simulated"), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Simulated Average Mortality Rate vs. Average GDP Per Capita",
       color = "Data Type")


library(patchwork)

(observed_plot | simulated_plot) + 
  patchwork::plot_layout(guides = 'collect') & 
  theme(legend.position = 'bottom')

Code
# need to fix labels, maybe add another mutate column

The code above generates two side-by-side plots. On the left side, we have the observed average mortality rates plotted against average GDP per capita, while on the right side, we see the simulated average mortality rates plotted against average GDP per capita. Upon comparison, it’s evident that the simulated mortality rates exhibit greater variability compared to the observed values due to a larger spread of data points around the regression line in the simulated plot. Furthermore, when examining the relationship between mortality rates and GDP per capita, similarities can be observed for lower values of average GDP per capita in both plots. However, as average GDP per capita increases, a notable difference emerges between the observed and simulated mortality rates, as the observed mortality rates appear to be higher than the simulated rates. This discrepancy suggests that our prediction model may not fully capture the complexities of the relationship between mortality rates and GDP per capita, especially at higher levels of average GDP per capita.

3.2 Generating Multiple Predictive Checks

Code
sd <- sigma(mortalitygdp_lm)
noise <- function(x, mean = 0, sd){
  x + rnorm(length(x), 
            mean, 
            sd)
}

nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(predictions, 
                                          sd = sd)
                              ))
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

sims <- year_avgs |> 
  filter(!is.na(`Avg_GDP_Per_Capita`)) |> 
  select(`Avg_Mortality_Rate`)|> 
  bind_cols(sims)



sim_r_sq <- sims |> 
  map(~ lm(Avg_Mortality_Rate ~ .x, data = sims)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)

sim_r_sq <- sim_r_sq[names(sim_r_sq) != "Avg_Mortality_Rate"]


tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025) +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models") +
  theme_bw()

The graph represents the distribution of R squared for all linear regression with all of the simulated data sets. With a roughly normal distribution skewed slightly to the left, it suggests that while many models demonstrate moderate explanatory power, there’s notable variability in their effectiveness in capturing the relationship between average GDP per capita and average mortality rate. The distribution being centered around 0.1 indicates that, on average, the models explain approximately 10% of the variability in average GDP per capita. This implies that while the model captures a portion of the variation, there’s still substantial unexplained variability. Therefore, while the model provides insights into the relationship between the variables, there’s room for improvement to increase its predictive accuracy.